Support GigaChat3 by ikawrakow · Pull Request #995 · ikawrakow/ik_llama.cpp

ikawrakow · 2025-11-21T09:59:00Z

This PR adds support for GigaChat3 and closes #994

The model uses the same MLA attention mechanism as DeepSeek, but with a twist, where the value length is not 128 as in DeepSeek models, but 192. I guess, everybody feels the need to make a creative alteration to an existing architecture.

Here some sweep-bench results for the 10GB-A1.8B variant (https://huggingface.co/ai-sage/GigaChat3-10B-A1.8B-bf16) quantized as Q8_0.

ik_llama.cpp, RTX-4080

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	256	0	0.167	12279.50	1.131	226.45
2048	256	2048	0.181	11321.48	1.159	220.96
2048	256	4096	0.226	9059.58	1.199	213.53
2048	256	6144	0.272	7531.24	1.231	207.89
2048	256	8192	0.317	6452.18	1.348	189.97
2048	256	10240	0.364	5619.66	1.380	185.54
2048	256	12288	0.409	5009.50	1.383	185.10
2048	256	14336	0.455	4499.59	1.388	184.42
2048	256	16384	0.500	4092.72	1.476	173.42
2048	256	18432	0.549	3729.05	1.511	169.48
2048	256	20480	0.596	3435.35	1.521	168.27
2048	256	22528	0.646	3168.05	1.521	168.28
2048	256	24576	0.695	2947.77	1.606	159.37
2048	256	26624	0.743	2757.12	1.644	155.68
2048	256	28672	0.796	2572.42	1.651	155.07
2048	256	30720	0.838	2444.78	1.654	154.81

llama.cpp, RTX-4080

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	256	0	0.268	7650.07	1.285	199.20
2048	256	2048	0.335	6120.87	1.325	193.18
2048	256	4096	0.444	4614.00	1.355	188.91
2048	256	6144	0.555	3688.46	1.395	183.52
2048	256	8192	0.654	3131.34	1.435	178.44
2048	256	10240	0.739	2770.27	1.575	162.50
2048	256	12288	0.832	2461.58	1.597	160.33
2048	256	14336	0.946	2165.80	1.610	159.01
2048	256	16384	1.045	1960.65	1.625	157.50
2048	256	18432	1.127	1816.52	1.637	156.40
2048	256	20480	1.238	1654.25	1.762	145.30
2048	256	22528	1.335	1533.68	1.775	144.23
2048	256	24576	1.435	1427.38	1.790	143.05
2048	256	26624	1.529	1339.09	1.797	142.47
2048	256	28672	1.615	1268.26	1.812	141.32
2048	256	30720	1.715	1193.94	1.936	132.20

ik_llama.cpp, CPU-only, Ryzen-7950X

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	128	0	2.373	862.94	4.032	31.75
2048	128	2048	3.546	577.61	4.295	29.80
2048	128	4096	4.769	429.44	4.487	28.53
2048	128	6144	6.052	338.42	4.671	27.40
2048	128	8192	8.031	255.01	4.853	26.37
2048	128	10240	9.201	222.58	5.081	25.19
2048	128	12288	10.847	188.81	5.229	24.48
2048	128	14336	12.062	169.79	5.478	23.37
2048	128	16384	13.330	153.64	5.643	22.68

llama.cpp, CPU-only, Ryzen-7950X

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	128	0	13.024	157.24	4.145	30.88
2048	128	2048	26.384	77.62	4.870	26.28
2048	128	4096	40.148	51.01	5.686	22.51
2048	128	6144	53.378	38.37	6.513	19.65
2048	128	8192	66.855	30.63	7.294	17.55
2048	128	10240	80.105	25.57	8.129	15.75
2048	128	12288	93.748	21.85	9.104	14.06
2048	128	14336	107.374	19.07	9.801	13.06
2048	128	16384	121.745	16.82	10.743	11.92

Nexesenex · 2025-11-21T14:41:08Z

Gigachad! :D

Note : Rope cache doubles the perplexity when used.

Nexesenex · 2025-11-21T17:50:41Z

Also, quantized K cache might need to be adjusted:

If I use:

llama-perplexity -m GigaChat3-10B-A1.8B-Q8_0.gguf -mg 2 --override-kv deepseek2.expert_used_count=int:4 -c 512 -mqkv -gr -ctk q8_0 -ctv q8_0 --host 127.0.0.1 --port 8080 -f wiki.test.raw

I get:

perplexity: tokenizing the input ..
perplexity: tokenization took 325.438 ms
perplexity: calculating perplexity over 610 chunks, n_ctx=512, batch_size=2048, n_seq=4
CUDA error: invalid argument
current device: 0, in function ggml_backend_cuda_buffer_set_tensor at Q:\GitHub\ik_llama.cpp.fks\ggml\src\ggml-cuda.cu:557
cudaMemcpyAsync((char *)tensor->data + offset, data, size, cudaMemcpyHostToDevice, ((cudaStream_t)0x2))
Q:\GitHub\ik_llama.cpp.fks\ggml\src\ggml-cuda.cu:123: CUDA error

ubergarm · 2025-11-21T18:33:09Z

@Nexesenex

I'm only testing on CPU where I'm not seeing that error (maybe makes sense given it looks like CUDA path issue).

I don't even know what -mg 2 is but I tried with various combinations of:

    -mg 2 \
    -mqkv \
    -ger \
    --override-kv deepseek2.expert_used_count=int:2 \
    -ctk q8_0 \

I didn't use -ctv q8_0 given this is MLA attention, but seems okay. It does get "dumber" going down to only 2 experts haha...

Nexesenex · 2025-11-21T21:40:58Z

@ubergarm : yeah, I tested with 3 experts and the PPL is not so bad. Time for inference now!

-mg is main gpu, a relic of the early llama.cpp versions used to select the GPU for mono-gpu inference, or the KV cache destination with split-row. I never knew if something else was aimed at by this command, so I set it still on my fastest GPU.

As for ctv, I always forget to remove it because it's either used, either irrelevant. :D

magikRUKKOLA · 2025-11-21T21:55:15Z

@Nexesenex

Note : Rope cache doubles the perplexity when used.

For the K2-Thinking too :)

Iwan Kawrakow added 3 commits November 21, 2025 10:59

Fixing Gigachat support

e0bc495

Gigachat: CUDA FA (needs 192 x 192 for MLA = 3)

360c8c6

Gigachat: CPU FA (needs 192 x 192 for MLA = 3)

0369d2b

ikawrakow mentioned this pull request Nov 21, 2025

GigaChat3 models check_tensor_dims has wrong shape #994

Closed

ikawrakow merged commit f119103 into main Nov 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support GigaChat3#995

Support GigaChat3#995
ikawrakow merged 3 commits intomainfrom
ik/support_gigachat

ikawrakow commented Nov 21, 2025

Uh oh!

Nexesenex commented Nov 21, 2025

Uh oh!

Nexesenex commented Nov 21, 2025

Uh oh!

ubergarm commented Nov 21, 2025

Uh oh!

Nexesenex commented Nov 21, 2025 •

edited

Loading

Uh oh!

magikRUKKOLA commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ikawrakow commented Nov 21, 2025

ik_llama.cpp, RTX-4080

llama.cpp, RTX-4080

ik_llama.cpp, CPU-only, Ryzen-7950X

llama.cpp, CPU-only, Ryzen-7950X

Uh oh!

Nexesenex commented Nov 21, 2025

Uh oh!

Nexesenex commented Nov 21, 2025

Uh oh!

ubergarm commented Nov 21, 2025

Uh oh!

Nexesenex commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

magikRUKKOLA commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Nexesenex commented Nov 21, 2025 •

edited

Loading